Language Identification Investigation

AuthorAaron Binns
DateJanuary 10, 2011

Summary

Evaluation of Java library used to identify the human language of archival web page content. Analysis of test results and thoughts on problems/challenges.

Introduction

A very common request of web archive users is to identify the human language in which the web page is written. Individual end-users may wish to know in which language a page is written. Researchers may only want to study pages in a particular language. Web archivists may only wish to harvest pages written in particular languages; using language as a factor in determining whether or not a page is inside or outside of a domain harvest's parameters.

Although HTML for many years has supplied mechanisms for declaring the language of the page content, they have not been widely used and cannot be relied upon for large-scale language identification. However, if language metadata is provided, it makes sense to utilize it in some capacity.

Language vs. Encoding

A related, and popular topic is "character encoding detection", which is not the same as language identification. Character encoding is the encoding and decoding of abstract characters to and from bytes. Language identification is taking those character sequences and determining to which human language they belong.

In this investigation, we kept the encoding issues separate and assumed that bytes have been properly decoded into characters. The language identification system then operates on characters only.

In some cases, the encoding can be used to guide the language identification. Although encodings like UTF-8 can encode all Unicode characters; others, such as KOI8-R only handles ASCII and Cyrillic and was originally intended for use with Russian and related languages. So, if we encounter a web page using KOI8-R, we could use that as a hint to the language identification system.

Identification vs. Dectection

As a matter of terminology, it seems that "identification" and "detection" are used interchangeably. I prefer to use "identification" to avoid confusion with "encoding detection" described in the previous section. However, some of the open source projects examined in this report use "language detection" in the project name; which I will use when refering to those projects.

Implementation Survey

A few simple web searches for "language identification" or "language detection" reveal a number of implementations, both commercial and open source.

One of the more popular commercial solutions is from Basis Technology, who provide an entire suite of language processing and analysis tools.

In the open source world, it seems that TextCat pioneered the application of a statistical n-gram approach as described in Cavnar, W. B. and J. M. Trenkle, N-Gram-Based Text Categorization, Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, UNLV Publications/Reprographics, pp. 161-175, 11-13 April 1994. TextCat spawned a number of derivative implementations in various programming languages.

Language Detection Library for Java

I chose to evaluate Language Detection Library for Java by Shuyo Nakatani.

The full details of the algorithm are described in the presentation linked on the project homepage.

To summarize, the blob of text is split into words, and the words are split into characters. Sub-sequences of characters are then matched against a library of subsequences with associated probabilities that the subsequence belongs to a particular language.

For example, "ion" and "thr" are common three-letter sequences in English; "äch" and "ünd" in German; and "şte" and "ată" in Romanian. The actual language profiles are much more elaborate and assign probabilities to each character sequence. So that even if a sequence is common in more than one language, when given a non-trivial input sequence, the languages can still be distinguished.

The basic algorithm is to take the 2 or 3-character sequences of each word, find the probability that the sequence is in each language and combine the probabilities as the text is analyized. In the end, zero or more languages are identified with a confidence score for each.

In many cases, a single language is identified with a high confidence, e.g. 99%. In some cases, multiple languages are identified, with decreasing confidence scores. And sometimes, the algorithm fails to identify any language at all.

Code

The code can be dowloaded from the Google Code project page:

http://code.google.com/p/language-detection/

You need two things:

which are all bundled in the langdetect download.

A simple command-line driver is provided:

java -jar lib/langdetect.jar --detectlang -d profiles <files>
which produces JSON output of the form:
<filename>:[<lang>:<confidence>]
with multiple language-confidence pairs separated by commas. For example:
foo.txt:[en:0.9999955743341075] bar.txt:[fr:0.9999969857118844] baz.txt:[en:0.714284495874296, fr:0.2857154992369151]

Since our main concern is to identify the language of web content, the simple command-line driver isn't directly applicable. What we want is to pass the text of a web page to the language identification library and get back a list language+confidence scores.

Since my pet project The JBs already provides the framework for running a Hadoop MapReduce job on NutchWAX segments (which contain the extracted webpage text), I hacked together some code to run the language identification inside a MapReduce job.

Integration

On the whole, integrating the langdetect library into The JBs and MapReduce was fairly straightforward. I chose to perform the language identification during the map stage, emitting the web page unique key (URL+hash) along with the JSON-format string listing the language+confidence pairs.

In simplified form, removing some exception handling and other syntactic sugar, the code looks like:

public void map( Text key, Writable value, OutputCollector output, Reporter reporter) throws IOException { Detector detector = DetectorFactory.create( 0.5 ); detector.append( value.toString() ); output.collect( key, new Text( detector.getProbabilities( ).toString() ) ); }

The 0.5 value is taken from the project's example code.

There were two problems integrating the langdetect libary into the Hadoop framework. Both had to do with assumptions in the library which were valid in a stand-alone program but failed in the Hadoop distributed environment.

Static Initialization

The first problem had to do with the fact that the langdetect library required one-time initialization. In a stand-alone program running in a single Java VM, this is easily achieved in many ways: static initializer, in the main() method, etc.

However, when running in Hadoop, one has to handle both the local and distributed contexts. If Hadoop is running on a local machine, then a single JVM is used; whereas in distributed mode multiple JVMs are spawened across many nodes for the various Map and Reduce tasks. In all cases the library must be initialized for each JVM.

To complicate matters, the library prohibits multiple initializations. If the initialization routine is called more than one time in a single JVM, the library throws an exception.

Ultimately, these problems were all solved by putting the initialization code into the Mapper and Reducer constructors then catching and ignoring any exceptions thrown due to multiple initialization. This handles both the local single-node-single-JVM case as well as the distributed environment.

Directory of Profiles

The other problem only occurrred in the Hadoop distributed environment, and is nearly the same as a problem I already encountered in NutchWAX. The langdetect library wants to open its profiles/ directory as an on-disk directory and iterate through the files contained therein.

This works just fine in the local single-node-single-JVM mode because the jar file containing our Hadoop MapReduce code is expanded in a directory in ${HADOOP_TMP} with the JVM's current working directory set to it. Thus, the langdetect code can open the on-disk profiles/ directory and iterate through the files.

However, in Hadoop 0.20 distributed mode, the jar file is sent over the wire to the TaskTrackers to run the MapReduce code, but is is not expanded into a temporary directory. When the langdetect library tries to open the directory, it fails.

I ran into essentially the same problem in NutchWAX where the code wanted to iterate through a directory listing of plugins. The work-around contains two parts

Nasty little hack, but it does the job.

Tests

I ran tests on sample web content from three different collections with different language profiles. The results of each are outlined below.

U.S. Congressional Websites

The first test was on a small set of WARC files from a Library of Congress collection. I keep these on my local machine for quick testing of new ideas and experimental code. They are primiarly pages from U.S. congressional websites, such as senators' homepages. I expect that the will all (or nearly all) be in English.

The langdetect library identified most as English, and although I did not examine every result, I assume they are correct. It also detected a small number of pages in Spanish, which was also correct as many U.S. senators and representatives have a pages targeted to their Spanish-speaking constituents.

A small number of pages were (mis)identified as a seemingly random language; but upon further examination it turned out that these pages almost always contained primarily computer source code, such as JavaScript or CSS. An obvious improvement to my test code is to exclude such pages from analysis.

Biblioteca Nacional de España

I then ran a test of a large sample of web pages from the 2010 domain harvests IA performed for Biblioteca Nacional de España. I expected a large portion of the pages to be identified as Spanish.

Overall, the aggregate results confirmed my expectations. Most of the pages were identified as Spanish, with English being the second most frequent. It also identified many pages as Catalan, of which I examined a few and they appeared correctly identified.

What was strange was the large number of pages identified as being in languages which are rather rare on the web: such as Guaraní.

Library of Congress Iraq War Collection

A test on a 5000-arcfile sample of the LoC Iraq War Collection exhibited similar aggregrate meta-results to the BNE collection. The top 10 are as follows:

ISO-649-1Language# Pages
ar Arabic1,214,679
az Azerbaijani 252,734
en English 135,899
gn Guaraní 34,814
da Danish 30,167
my Burmese 1,1840
fo Faroese 8,880
is Icelandic 4,931
fr French 2,327
hr Croatian 1,117

Although, as one would expect, Arabic and English are in the top three, it's rather strange that Azerbaijani, Guaraní and Danish would be the others in the top five.

Looking at pages identified as Azerbaijani, it seems that the pages in the Iraq War collection trigger a common failure-mode in the use of the langdetect library: pages with multiple languages.

I examined a dozen or so pages identified as Azerbaijani and they all seemed to have a mixture of Arabic and English content. One illustrative example was a homepage on blogger.com where the page template was in English, but the blogger's name, description and so forth were in Arabic.

I suspect that when given a page with multiple languages as a single input to the langdetect library, it will often times mis-identify it as an otherwise rarely occurring language.

Analysis

As mentioned above, analyzing the tests results immediately reveal two issues:

The first is easily addressed by looking at the MIME type of the page, or perhaps even the filename. In general we omit these types when performing language analysis on web content, and only did not do so here due to programmer haste to get the test code running.

Multiple Languages

A much more complex issue is multiple languages on a single page. It immediately calls into question the very notion of assigning a single "language" to a page. Consider the blogger.com example from the previous section. Should that page be identified as solely Arabic or soley English? Wouldn't it be more useful to state that it contains content in both languages?

We should rephrase the original question, not asking "Which language is the page?" but rather "Which languages are used on this page?"

Ideally, we would use the HTML markup structure to separate the page into chunks and analyze each one. But in our current processing system, the page's plain text is extracted and all the HTML markup is discarded. But this is something that we could revisit.

Multiple scripts

When analyzing the results from the Iraq War collection test, I was puzzled by the large number of pages identified as Azerbaijani. Then I discovered that Azerbaijani can be written in both Arabic and Latin scripts. For example:

ScriptText
Arabicآذربایجان دا انسان حاقلاری ائوی آچیلاجاق ب م ت ائلچيسي برمه موخاليفتي نين ليدئري ايله گؤروشه بيليب ترس شوونيسم فارس از آزادي ملتهاي تورکمن
Latina az qalıb breyn rinq intellektual oyunu üzrə yarışın zona mərhələləri keçirilib miq un qalıqlarının dənizdən çıxarılması davam edir məhəmməd peyğəmbərin karikaturalarını çap edən qəzetin baş redaktoru iş otağında ölüb

Reconsidering the statistical n-gram algorithm, I hypothesize that on a page of mixed Arabic and English, the Arabic part matches the Arabic-script Azerbaijani n-grams and the English part matches the Azerbaijani n-grams. And since Arabic won't match the English parts, and English won't match the Arabic parts, Azerbaijani is the one that matches both.

And there are many languages which utilize more than one script, Azerbaijani is not the only one. For example Bosnian, Serbian, Turkmen, Tatar, Uyghur, and Uzbek all use both Latin and Cyrillic scripts; Kazakh and Kyrgyz use all three: Latin, Cyrillic and Arabic.

Since scripts are easily distinguished in Unicode, an obvious tactic to try is to first separate the words by script, then analyze each one separately. So for the Iraq War pages with both Arabic and English, we would separate the Arabic parts from the Latin parts and analyze each in turn. This would produce a multi-valued result — that the page has both Arabic and English — rather than forcing the library to make a single, erroneous choice for the entire page.

Chromium

After pondering the test results, I took a look at Chromium, the OpenSource project behind the Google Chrome web browser. Chrome has a built-in language identification feature and I was curious if it was in the open source project. Fortunately, it is:

chromium/src/third_party/cld/

contains the "compact language detection library".

Reading through the code, it appears that they use a similar algorithm based on statistical n-grams with profiles for each language. They also incorporate many of the speculations in the previous sections as well as other enhancements:

The bulk of the code is at

chromium/src/third_party/cld/encodings/compact_lang_det/compact_lang_det_impl.cc

Conclusions

As outlined in the sections above, a few tentative conclusions can be drawn:

One thing is for sure, language identification is not simple and is not "fire and forget". To achieve a language identification capability suitable for web archives will require further experimentation and refinement.

Addendum: Java 6 Unicode Support

I recently came across this note on StackOverflow.com, which points out that Java 6 only supports Unicode 4.0, where the current (June 2011) version of Unicode is 6.0:

Obtaining unicode characters of a language in Java

Although it probably doesn't make much difference to people working with European languages, I believe post-4.0 Unicode greatly enhances support for Asian languages, bi-directional handling, and lots of other errata. The aspect of most relevance to this paper is the use of Unicode script properties.

In Java, you can use Unicode categories and blocks to easily identify character properties. For example, the category "Lu" is all upper-case letters in all languages; and "InGreek" is the block of Greek characters. These properties are accessible in regular expressions too, allowing a regex to get all upper-case letters with:

Pattern p = Pattern.compile( "\p{Lu}" );

Now, the post on StackOverflow points out that using Unicode blocks to identify the script of a character is wrong because some characters are spread over different language blocks and sometimes blocks contain characters from another script. That means that using:

Pattern p = Pattern.compile( "\p{InGreek}" );

will match the 18 characters that are in the Greek block but are not actually Greek characters. Yikes.

A reliable way to identify the script is to use the Unicode 6.0 "script" property. The problem is that Java 6 doesn't have Unicode 6.0 support, Java 6 is only Unicode 4.0 and therefore does not have the script property for Unicode characters. The forthcoming Java 7 is supposed to have full Unicode 6.0 support.

As a work-around, the StackOverflow poster provides links to some Perl programs which can be used to query the Unicode 6.0 character database, including the script property. With Perl (and Unicode 6.0) one can use a regular expression '\p{Script=Greek}' to match all Greek characters.

So, if we want to use Java to determine to which script a character belongs, e.g. is it Greek, Arabic, Cyrillic, Latin, etc., we cannot get an entirely accurate answer with Java 6.

A potential work-around might be to use the Perl scripts to generate tables of Unicode code-points for each script of interest (Greek, Arabic, etc.), then code up our own Java routines to use those tables to identify script membership for characters.